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Clustering uncertain data has emerged as a challenging task 
in uncertain data management and mining. Thanks to a 
computational complexity advantage over other clustering 
paradigms, partitional clustering has been particularly stud- 
ied and a number of algorithms have been developed. While 
existing proposals differ mainly in the notions of cluster cen- 
troid and clustering objective function, little attention has 
been given to an analysis of their characteristics and lim- 
its. In this work, we theoretically investigate major existing 
methods of partitional clustering, and alternatively propose 
a well-founded approach to clustering uncertain data based 
on a novel notion of cluster centroid. A cluster centroid is 
seen as an uncertain object defined in terms of a random 
variable whose realizations are derived based on all deter- 
ministic representations of the objects to be clustered. As 
demonstrated theoretically and experimentally, this allows 
for better representing a cluster of uncertain objects, thus 
supporting a consistently improved clustering performance 
while maintaining comparable efficiency with existing parti- 
tional clustering algorithms. 

1. INTRODUCTION 

Uncertainty in data naturally arises from a variety of real- 
world phenomena, such as implicit randomness in a process 
of data generation/acquisition, imprecision in physical mea- 
surements, and data staling [1]. For instance, sensor mea- 
surements may be imprecise at a certain degree due to the 
presence of various noisy factors (e.g., signal noise, instru- 
mental errors, wireless transmission) [6]. Another example 
is given by moving objects [19], which continuously change 
their location so that the exact positional information at a 
given time can only be estimated when there is a certain la- 
tency in communicating the position (i.e., data is inherently 
obsolete). The biomedical research domain abounds of data 
inherently affected by uncertainty. As an example, in the 
context of gene expression microarray data, handling the so- 
called probe-level uncertainty represents a key aspect that 
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allows for a more expressive data representation and a more 
accurate processing [15]. Further examples of uncertain data 
come from distributed applications, privacy preserving data 
mining, and forecasting or other statistical techniques used 
to generate data attributes [1]. 

In general, uncertainty can be considered at table, tuple 
or attribute level, and is formally specified by fuzzy models, 
evidence- oriented models, or probabilistic models [18]. This 
work focuses on data containing attribute-level uncertainty, 
which is probabilistically modeled. In particular, we are in- 
terested in probabilistic representations that use probability 
distributions to describe the likelihood that any object ap- 
pears at each position in a multidimensional space [4,12-14]. 
We hereinafter refer to data objects described in terms of 
probability distributions as uncertain objects. 

Given a set of data objects, the problem of clustering is to 
discover a number of homogeneous subsets of objects, called 
clusters, which are well-separated from each other [10]. In 
the context of uncertain data, clustering uncertain objects 
has recently emerged as a very challenging problem aimed 
at extending the traditional clustering methods (originally 
conceived to work on deterministic objects) to deal with 
objects represented in terms of probability distributions. 

Partitional approaches to clustering of uncertain objects 
include the fastest algorithms so far defined, namely UK- 
means [4,14], which is an adaptation of the popular K-means 
clustering algorithm to the context of uncertain objects, and 
MMvar [8], which exploits a criterion based on the mini- 
mization of the variance of cluster mixture models. Both 
algorithms involve formulations based on two main notions: 
cluster centroid, which is an object/point that summarizes 
the information of a given cluster, and cluster compactness 
which is based on the assessment of proximity between the 
uncertain objects assigned to the cluster and the correspond- 
ing centroid. These notions represent a key aspect for the 
efficiency of partitional methods; in fact, different formula- 
tions based on pairwise comparisons between the objects in 
a cluster (e.g., [7]) are inevitably less efficient. 

Though quite efficient, both UK-means and MMVar for- 
mulations however suffer from some critical weaknesses that 
limit the effectiveness of such algorithms. In UK-means the 
centroid of a cluster of uncertain objects is reduced to be a 
deterministic (i.e., non-uncertain) point that is simply de- 
fined as the average of the expected values of the objects 
belonging to that cluster. As a consequence, the notion 
of within-cluster uncertainty in UK-means discards any in- 
formation about the variance of the cluster members, and 
hence of the cluster itself. It is quite intuitive that equipped 
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Figure 1: Uncertain objects with same central ten- 
dency: (a) lower-variance, more compact cluster, 
and (b) higher-variance, less compact cluster 
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Figure 2: Uncertain objects with different central 
tendency: (a) lovi^er- variance, less compact cluster, 
and (b) higher- variance, more compact cluster 



with such a simplistic definition of cluster centroid, the al- 
gorithm can easily fail in distinguishing clusters with the 
same central tendency but different variance. For instance, 
consider the situation in Figure 1: since the group of ob- 
jects in Figure l-(a) has the same (sum of) expected val- 
ues as the group in Figure l-(b), taking into account the 
variance of these groups is essential to recognize the one 
with lower variance (Figure l-(a)) as more compact than the 
other with higher variance (Figure l-(b)). As we formally 
show in Section 4, an analogous issue affects MMVar too. 
On the other hand, however, considering only the variance 
to determine cluster compactness may lead to wrong results 
as well. As an example, consider the objects in Figure 2- (a), 
which have individual variances smaller than the objects in 
Figure 2-(b): the latter set of objects clearly represents a 
cluster more compact than the set in Figure 2-(a); however, 
a criterion based only on the minimization of the variance of 
the uncertain objects would mistakenly recognize the group 
in Figure 2-(a) as better than the group in Figure 2-(b). 

In light of the above remarks, a major goal of this work is 
to address the problem of partitional clustering of uncertain 
objects in order to improve the accuracy of existing parti- 
tional clustering methods, while maintaining high efficiency. 
Our contributions can be summarized as follows. We pro- 
vide a deep insight into the UK-means and MMVar formula- 
tions, and formally describe the aforementioned issues. We 
prove that the UK-means and MMVar objective functions 
differ only by a constant factor, and that no improvement 
is obtained by employing an objective function that would 
mix the UK-means cluster compactness criterion and the 
MMVar cluster centroid definition. Therefore, we propose a 
formulation to the problem of clustering uncertain objects 
based on a novel notion of cluster centroid. The proposed 
centroid, called U-centroid, is an uncertain object defined in 
terms of a random variable whose realizations correspond 
to all possible deterministic representations deriving from 
the (deterministic) representations of the uncertain objects 



to be clustered. We derive the analytical expressions of do- 
main region and probability distribution of the proposed 
U-centroid. However, as these expressions are in general 
not analytically computable, we define a cluster compact- 
ness criterion that can be efficiently computed according to 
some closed-form. In particular, after proving that apply- 
ing the MMVar criterion to the proposed U-centroid is not 
suited for measuring cluster compactness — it reduces to only 
consider the variances of the individual uncertain objects, 
which is not sound as illustrated in Figure 2 — we show that 
minimizing the sum of the distances between each object 
in a cluster and the corresponding U-centroid fulfils all de- 
sired requirements. We propose a local search-based heuris- 
tic algorithm, called U-Centroid-based Partitional Cluster- 
ing (UCPC), which searches for a local minimum of the 
objective function of the proposed formulation by suitably 
exploiting the aforementioned cluster compactness closed- 
form expression, thus guaranteeing both effectiveness and 
efficiency requirements. 

We have conducted an extensive experimental evaluation 
on several data, including datasets with uncertainty gener- 
ated synthetically as well as real-world collections in which 
uncertainty is inherently present. The proposed UCPC algo- 
rithm outperforms state-of-the-art partitional, density-based 
and hierarchical algorithms in terms of clustering accuracy 
(according to external and internal cluster validity criteria). 
Moreover, from an efficiency viewpoint, UCPC is compa- 
rable to the fastest existing partitional methods, i.e., UK- 
means and MMVar, and can also perform better than prun- 
ing-based variants of the basic UK-means algorithm. 

The rest of the paper is organized as follows. Section 2 
provides background notions and definitions that will be 
used in this work. Section 3 theoretically shows the weak- 
nesses of UK-means and MMvar. Section 4 discusses our 
proposal in detail. Section 5 presents experimental settings 
and results. Section 6 concludes the paper. In Appendix, 
proof sketches of the main results are finally provided. 

2. BACKGROUND 

Existing research on clustering uncertain objects has fo- 
cused mainly on adapting traditional clustering algorithms 
to handle uncertainty. Adaptations have been made for each 
of the major categories of clustering methods, namely parti- 
tional (UK-means [4,14], UK-medoids [7], and MMVar [8]), 
hierarchical (U-AHC [9]), and denstty-based (J^DBSCAN 
[12], J^OPTICS [13]). Note that, due to the well-known 
computational complexity advantages w.r.t. the other clus- 
tering paradigms, the partitional clustering methods have 
attracted more attention, and a number of works have also 
been devoted to improve the runtime performance of K- 
means like clustering methods [5, 11, 14, 16, 17]. 

In the following, we provide formal details on how un- 
certain objects are represented, and key notions in the two 
methods mainly under consideration in this work, namely 
UK-means and MMVar. 

2.1 [Modeling Uncertainty 

Uncertain objects are typically represented according to 
multivariate uncertainty models [8], which involve multidi- 
mensional domain regions and multivariate pdfs. 

Definition 1. A multivariate uncertain object o is a pair 
{TZ, /), where TZ C W^ is the m-dimensional domain region 
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of o and f : 5R™' — >■ Kq" is the probability density function of 
at each point x G SR"*, such that: 

f{x) > 0, Vf G 7^ and /(f) = 0, Vf G K™ \ 7^ (1) 

For arbitrary pdfs of uncertain objects, we assume statis- 
tical independence between the actual deterministic repre- 
sentations X, x' G K™ of any two given uncertain objects 
o = (7^,/), o' = (7^',/')• Formally, Vf,i' G K™, it can be 
assumed that: 

Pr(o = x,o = X ) = Pr(o = x) Pr(o = x ) — f{x) f (x) 

where "o = x" denotes the event "the actual representation 
of the uncertain object a correspond to the point x G 5R""'. 

The expected value (/2), second order moment {JI2), and 
variance {(P) vectors of an uncertain object o = {TZ, f) are 
defined as follows: 

/*(") = / ^ fix) dx A'2(o) — X f{x) dx (2) 



a {o)= (x-/i(o)) f{x) dx = fl2{o)-p (o) (3) 

The ji-th component (j G [l..m]) of the /I, /I2 and a^ vectors 
is as follows: 

l^]{o)= I Xjf{x)dx (/i2)j(o)=/ a:://(f) df (4) 
{(^^)j{o)^ [xj ~fij{o)ff{x) dx^ {H2)j{o)-^j.'^^{o) (5) 

JxS1Z 

Moreover, given a vector a^ of variances, the "global" vari- 
ance expressed in terms of a single numerical value is defined 
as the sum of variances along each dimension: 



a^(o) = |la^(o)|li=^(a^),=/ || 



^-m(o)II /(^) dx (6) 



2.2 UK-means 

Given a cluster C of uncertain objects, the centroid of C 
according to the basic UK-means algorithm [4] is a deter- 
ministic (i.e., non- uncertain) point Gjx defined by averaging 
over the expected values of the objects within C: 



CvK- 



^\l^f^'^ 



(7) 



Given a candidate clustering C, the basic UK-means mini- 
mizes the objective function X^cgc JbUK{C), with JbUK{C) = 
"^^^Q EDd{o,CuK)- EDd{-,-) denotes the expected distance 
between an uncertain object o = (7?., /) and a point, and is 
defined as EDd{o,y) — j\ ^d{x,y) f{x) dx, where d(-,-) is 
any metric measuring the distance between two m-dimension- 
al points. Note that the computation of the integral in EDd 
represents a major bottleneck in the execution of the basic 
UK-means, since in the general case it cannot be computed 
in a closed-form but it requires an approximation based on a 
(typically large) set of statistical samples to be drawn from 
the pdf of the objects. The cost of this integral approxima- 
tion is not negligible, and hence the overall complexity of 
the basic UK-means is 0{I S k n m), where n is the size of 
the input set of uncertain objects, m is the dimensionality 
of the uncertain objects, k is the desired number of clusters. 



S is the cardinality of the sample set, and I is the number 
of iterations for the algorithm convergence. 

To speed-up the execution of the basic UK-means, most 
work has been done on developing pruning techniques, whose 
general goal is to avoid the computation of redundant ex- 
pected distances between uncertain objects and (candidate) 
cluster centroids. Major contributions in this regard are 
proposed in [16] (MinMax-BB algorithm), [11] (VDBiP al- 
gorithm), and [17], where the cluster-shift technique is in- 
troduced as a general method to further tighten bounds ob- 
tained by existing pruning strategies. However, the worst- 
case complexity of the basic UK-means is not reduced by 
the pruning techniques. For this purpose, [14] proposes a 
different approach based on a modification of the formula 
of the expected distance: denoting simply by ED{-,-) the 
expected distance EDd when the metric d is the squared 
Euclidean norm ]| • |p, [14] shows that: 



ED{o,Cuk) 



\\x - Cuk\\ f{x) dx 



x&TZ 



= ED{o,fl{o)) + \\CuK-fl{o)\\'' (8) 



Thus, the expected distance between any uncertain object 
o and centroid Cuk is equal to the sum of two terms: the 
first, which is the most expensive one, is given by the ex- 
pected distance between o and its expected value /2(o) , while 
the second one, which is efficiently computable in 0{m), is 
equal to the (squared) Euclidean distance between the cen- 
troid Cuk and /2(o). Since the first term does not change 
during the execution of the algorithm (and hence it can be 
precomputed for each input object), the algorithm in [14] 
has an "online" complexity of 0{I k n m). Unless other- 
wise specified, throughout the rest of this paper we refer to 
the algorithm in [14] as UK-means and to its cluster com- 
pactness criterion as 



Juk{C) = Y, ED{o, Cuk) 



(9) 



o€C 



2.3 MMVar 

In the MMVar algorithm, the centroid of a cluster C is 
defined as an uncertain object Cj\/A/that represents a mixture 
model of C: 

Cmm = [TImm, fMRl) (10) 

where TZmm — Uosc ^' ^'i'-' ^^® F"^^ ./i\/A/(^) is defined as the 
average [^["^X^oec /(^) of the pdfs of the objects within C. 
The cluster compactness criterion Jmm employed by MM- 
Var is simply based on the minimization of the variance of 
the cluster centroid: 



Jmm(C) — a (Cmm) 



(11) 



Analogously to UK-means, the overall objective function to 
be minimized for a candidate clustering C is X^csc Jmm{C), 
and the complexity of the algorithm is 0{I k n m) [8]. 

3. COMPARING UK-MEANS AND MMVAR 

UK-means shortcomings. As previously discussed, UK- 
means is characterized by a deterministic definition of clus- 
ter centroid, which is simply the average of expected values 
of the cluster members (7). This implies that UK-means 
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does not explicitly take into account the individual vari- 
ances of the objects that belong to a cluster. As shown in 
the following proposition, a major consequence is that two 
clusters can have the same value of objective function Juk 
regardless of their respective cluster variance (i.e., sum of 
variances of the objects that belong to a cluster). 

Proposition 1. Given any two clusters C and C' of un- 
certain objects, it holds that; 

Juk{C) ^ Juk{C') ^ ^a2(o)= Yl ^'(«') 
oec o'ec 

The above proposition states that the compactness crite- 
rion at the basis of UK-means might not discriminate among 
groups of uncertain objects having different cluster vari- 
ances. Moreover, it can be straightforwardly derived from 
the proposition that 

^a2(o)/ Y. '"'("') ^ Juk{C) ^ Juk{C') 



the centroid (mixture model) Cmm = {TZmm, /mm) of C'- 



oSC 



o'eC 



i.e., different values of cluster variance for C and C' do not 
necessarily force the values Juk{C) and Juk{C') to be differ- 
ent. These aspects may lead to situations like that already 
illustrated in Figure 1, where the cluster algorithm is unable 
to recognize a cluster with smaller variance as less uncertain, 
hence more compact, than a cluster with higher variance. 

Comparing UK-means andMMVar objective functions. 

The UK-means weaknesses could in principle be overcome 
by the MMVar criterion Jmm, since MMVar centroids in- 
volve uncertainty in their representation. Unfortunately, 
this is not true, as Juk and Jmm can be demonstrated to 
be very close to each other. In particular, as formally shown 
in the following proposition, the UK-means and MMVar ob- 
jective functions differ from each other only by a constant 
factor. This clearly implies that the aforementioned UK- 
means accuracy issues affect MMVar as well, though the 
MMVar centroid definition involves uncertainty. 

Proposition 2. Let C be a cluster of m-dimensional un- 
certain objects, where o = {TZ,f), Wo £ C. In reference to 
the functions Juk and Jmm defined m (9) and (11), respec- 
tively, it holds that Jmm{C) = \C\~^ Juk{C) . 



Trying to overcome the limitations of UK-means and 
MMVar. In the attempt of deriving an alternative objective 
function that overcomes the limitations shared by UK-means 
and MMVar, a straightforward solution could be to combine 
the definition of MMVar centroid with the UK-moans clus- 
ter compactness criterion, obtaining the following objective 
function J{C) (by contrast, note that taking the notion of 
centroid from UK-means while employing the MMVar clus- 
ter compactness criterion would be meaningless, since the 
variance of a deterministic centroid is zero): 



J{C) = YeD{o,Cmm) 



(12) 



oec 



where ED denotes the (squared) expected distance between 
any two uncertain objects [7], and is exploited here to com- 
pute the distance between any object o = (JZ, f) £ C and 



ED{o,Cmm)'- 



x&TZ J yen. 



'VW f{^)fMMiy)'i^<iy (13) 



Unfortunately, adopting such an objective function J is not 
appropriate yet, as J is in turn proportional to the functions 
Juk and Jmm, as shown in the following. 

Proposition 3. LetC be a cluster of m-dimensional un- 
certain objects,, where o = {TZ,f), Vo G C. In reference to 
the functions Juk, Jmm, and J defined m (9), (11), and (12) 
respectively, it holds that J{C) — 2 \C\ Jmm(C) = 2 Juk{C). 

4. UNCERTAIN CENTROID BASED 
PARTITIONAL CLUSTERING 

4.1 U-centroid 

We have demonstrated that the weaknesses of both UK- 
means and MMVar objective functions cannot be overcome 
even mixing their basic elements, i.e., the notions of cluster 
centroid and cluster compactness criterion. 

To define a sound objective function for partitional clus- 
tering of uncertain objects, we propose here a solution based 
on a novel notion of cluster centroid. Our key idea is to take 
into account the random variable whose realizations describe 
all possible deterministic representations of the centroid be- 
ing defined, in such a way that each of these representations 
corresponds to the point that minimizes a certain distance 
function (e.g., the (squared) Euclidean distance) between 
itself and a set of possible representations of the uncertain 
objects in the given cluster. More precisely, to define the 
proposed centroid C of a given cluster C of uncertain ob- 
jects, we consider a real- valued random variable Xc, whose 
observation space is comprised of the events "£ is the actual 
deterministic representation of C", Vi? G SR™. The ratio- 
nale underlying Xc is as follows: since each object within 
C has a multiple representation due to its own uncertainty, 
the centroid of C should have in turn a multiple represen- 
tation that takes into account the various representations 
of the objects within C; in particular, it should be required 
that each specific representation of the centroid of C derives 
from the minimization of a certain distance measure (e.g., 
Euclidean distance) from a set of points, each correspond- 
ing to a possible realization of an uncertain object to be 
summarized. 

Figure 3 illustrates the above concept. Three 2-dimensional 
uncertain objects forming a cluster are represented along 
with the cluster centroid — for the sake of simplicity, only 
the domain regions of the uncertain objects are depicted, as 
the reasoning being explained holds regardless of a partic- 
ular pdf. The figure shows that the actual representation 
of the centroid C should change according to the specific 
points considered for 7?.', 7?.", and 7?.'" as actual represen- 
tations of the uncertain objects o', o" and o'" , respectively. 
For example, the set of points {x' ,x" , x'"} would lead to the 
representation of C corresponding to the point x. 

In this way, our notion of uncertain centroid, named U- 
centroid, is conceived to gain two main conceptual advan- 
tages over existing notions of centroid for uncertain object 
clusters: 1) it addresses the shortcomings that are typical 
of a deterministic centroid notion (and hence overcomes a 
major drawback of the UK- means centroid definition), and 
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Figure 3: Example of uncertain cluster centroid 
computation based on multiple deterministic rep- 
resentations of uncertain objects 



2) it has a clear stochastic meaning. Particularly, the latter 
is related to an improvement of the notion of centroid in 
MMVar, as U-centroid is defined in terms of a random vari- 
able that describes the outcomes of a clear set of probability 
events whereas MMVar centroid is a mixture model defined 
by averaging the pdfs of the objects in a cluster. 

In the following, we formally define the proposed notion 
of U-ccntroid, as an uncertain object C — (JZ, f) defined 
in terms of the random variable Xc- Note that / is a pdf 
that expresses the probability of each realization of Xc, i.e., 
7(f) = Pr(C = x) = Pi{Xc = f ), Vf G ^. 

Theorem 1. Given a cluster C — {oi, . . . ,o^c\} of m- 
dimensional uncertain objects, where Oi = (TZi, fi) andTZt = 

[4^Ur^]x---x[^^'"U!'"'], ViG [l..|Cll, letC={nJ) he the 

U-centroid of C defined by employing the squared Euclidean 
norm as distance to be minimized. It holds that: 



f{S) 



x-i^TZi 



•|6-R|c 



1 "" 



]^/i(£i)df 1 • ■ ■ Ax\, 



11 = 









where \[A\ is the indicator function, which is 1 when the 
event A occurs, otherwise. 

It can be straightforwardly derived that / and 7?. satisfy 
the conditions reported in (1), which make any U-centroid 
an uncertain object according to Definition 1. 

4.2 U-centroid-based Cluster Compactness 

In general, the pdf / of any U-centroid reported in The- 
orem 1 cannot be computed analytically. Hence, we define 
a cluster compactness criterion based on the notion of U- 
centroid that does not require to explicitly compute /. 

4.2.1 Minimizing the U-centroid Variance 

As an U-centroid is an uncertain object, an intuitive defi- 
nition of a U-centroid-based cluster compactness would be to 
exploit the same approach as MMVar, i.e., the minimization 
of the variance of U-centroids. Like in MMVar, a major ad- 
vantage of this choice would lie in the capability of exploiting 
an analytical formula to the variance computation. Given a 



cluster C of uncertain objects, this formula would allow for 
efficiently computing the variance of the U-centroid of any 
new cluster CU{o} (obtained by adding an uncertain object 
to C) or C \ {o} (obtained by removing an uncertain object 
from C) linearly with the number m of dimensions. 

The following theorem shows that the variance of the U- 
centroid of a cluster C is equal to the average of the variances 
of the individual uncertain objects within C. Though the 
resulting expression is easy and fast to compute, we prove 
that however minimizing the variance of the U-centroid is 
not appropriate to build compact clusters of uncertain ob- 
jects. In fact, measuring cluster compactness according only 
to the variances of the individual uncertain objects may lead 
to wrong results as in general it does not take into account 
the distances among data objects (cf. Figure 2). 

Theorem 2. Given a cluster C = {oi, . . . , 0|c|} of m- 
dimensional uncertain objects, where Oi = {TZi,fi), Vi G 
[l..|C|], letC — (JZ, f) be the U-centroid of C defined accord- 
ing to Theorem 1. It holds that cr^(C) = |C|"^ Y}Z\ o'^{oi). 

4. 2. 2 Minimizing the Expected Distance between Un- 
certain Objects and U-centroids 

After proving that the variance of U-centroid is not effec- 
tive to measure the cluster compactness, we focus here on 
a different U-centroid-based criterion, which consists in the 
minimization of the sum of expected distances between the 
objects in a cluster and the U-centroid of that cluster: 



J(C) = ^i?D(o,C) 



(14) 



Again, ED denotes the (squared) expected distance between 
any two uncertain objects (cf. (13)). 

In the following theorem, we show that the proposed ob- 
jective function J overcomes the limitations of all cluster 
compactness criteria previously discussed in the paper. Spe- 
cifically, J directly exploits the sum of the variances of the 
individual objects in a cluster, and hence it explicitly solves 
the major UK-means/MMVar issue (cf. Section 4 and Fig- 
ure 1); in addition, J takes into account the sum of expected 
values of objects as well, which allows for overcoming the is- 
sue arising from the criterion based on the minimization of 
the variance of C only (cf Section 4.2.1 and Figure 2). 

Theorem 3. Let C — {oi, . . . ,o^c\} be a cluster of m- 
dimensional uncertain objects, where Oi = {7Zi,fi), Vi G 
[l..|C|], and C = (TZ, f) be the U-centroid of C defined ac- 
cording to Theorem 1. In reference to the function J defined 
in (14), it holds that: 



\C\ 



\c\ 



j=l \ ' ' ' ' / ' ' i^l 

where Juk is the UK-means objective function (cf. (9)) and 

\2 



\c\ 



\c\ 



\c\ 



*H'=E'^')^(''0 4''=E^2),(oo T<^>= Ew(o. 



Corollary 1. Let C be a cluster of uncertain objects, 
and C^ = CU {0+}, C^ ~ C\ {o^} be two clusters defined 
by adding an object o (fi C to C and removing an object 
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Algorithm 1 UCPC 



Input: A set T) of rn-dimcnsional uncertain objects; 

of output clusters 
Output: A partition (clustering) C of T) 



the number k 



1 


compute fl{o), fi2{o), <7^(o), Vo G X> 


2 


C ■<— initialPartition{T> , k) 


3 


compute *^^', *<^', T<^', J(C), Vi G [l..m], VC G C, according 




to Theorem 3 


4 


repeat 


b 


V^T.cecJiC) 


6 


for all o e D do 


7 


let C° G C be the cluster s.t. o G C° 


8 


C*^argmincecV-[J(C°) + J(C)] + [JCCVo}) + J(CU{o})] 


y 


if C* 5<^ C° then 


iU 


lot C+ = C* U {o}, C" = C° \ {0} 


11 


C ^C\{C*,C°}U{C+,C"} 


12 


replace *<^i; , $^i: , T^' , JiC) with *^^ , *^:^ , T^^ , 




J{C+), Vj" G [l..m], according to (15) 


13 


replace *^^ , $^^ , T<^^ , J(C°) with *^^ , *^^ ■' '^c- ' 




J(C~), \tj G [l..m], according to (16) 


14 


end if 


15 


end for 


16 


until no object in X> is relocated 


o^ 


G C from C, respectively. In reference to the expres- 


5ic 


n J{C) = Er=i (|Cr'*g' + $g> - iq-^Tg)), rfen^erf 


in 


Theorem 3, it holds that: 




m / ^f|U) yU) \ 




^(c-)-E(,cr;,+*s-|cr;,) (1=) 




m / ^U) yU) \ 




i(r~) V "^^ 1 ■T'*^' ^" fiBi 




■^^^ ^ 2^1 |c|-i ' ^c- \c\-i ^^^> 




j=i V ' ' ' / 



where *^^i = ^g' + {a%{o+), $g| = $g' + (m2).(o+), 



t'j) — 



$gl = <J>g' - (m2).(o-), and Tgl = f y^ - M.(o-) 



It should also be emphasized that Theorem 3 provides a 
closed-form expression for J that does not require to ex- 
plicitly compute the pdf / of C. This expression, which 
is efficiently computable in OdCI m), is given by a linear 
combination of the terms ^J^ , $^ and T|^ , Vj € [I..771]. 
This result puts the basis also for an efficient computation 
of the objective functions J{C^) and J{C~) of any two clus- 
ters C^ and C~ defined by adding/removing an uncertain 
object to/from the original cluster C . According to Corol- 
lary 1, in fact, this can be done in 0{m), as the terms ^p| , 
$^| and T^l (rcsp. ^^l , $^1 and T^l ) that compose 
function J{C'^) (resp. J{C~)) can be computed in constant 
time given the earlier ^q\ ^q and T^ terms and the 
expected value, second order moment and variance of the 
object to be added/removed to/from C (cf. (15)-(16)). 

4.3 The UCPC Algorithm 

The problem of partitional clustering of uncertain ob- 
jects can be formulated as C* =arg mine 5^(-.g|j J(C). As 
it refers to an NP-hard problem, we define a local search- 
based heuristic that exploits the results reported in The- 
orem 3 and Corollary 1 to compute effective and efficient 
approximations. 



Table 1: Datasets used in the experiments 

(a) Benchmark datasets (b) Real datasets 



dataset 


obj. 


attr. 


classes 


Iris 


150 


4 


3 


Wine 


178 


13 


3 


Glass 


214 


10 


6 


Ecoli 


327 


7 


5 


Yeast 


1,484 


8 


10 


Image 


2,310 


19 


7 


Abalone 


4,124 


7 


17 


Letter 


7,648 


16 


10 


KDDCup99 


4,000,000 


42 


23 



dataset 


obj. 


attr. 


Neuroblastoma 
Leukaemia 


22,282 
22,690 


14 

21 



Algorithm 1 shows our proposed heuristic algorithm, called 
U-C'entroid-based Partitional Clustering (UCPC). After a 
preliminary phase in which the expected value, second order 
moment and variance of each object within the input dataset 
T> are computed (Line 1), UCPC starts by taking an initial 
partition of T> (Line 2) (e.g., a random partition). Then it 
follows an iterative procedure such that, at each step, a new 
clustering is searched to be better than the one obtained at 
the previous iteration (Lines 4-16). To profitably exploit 
Theorem 3 and Corollary 1, the new clustering is formed by 
looking at all possible relocations of an object from its early 
cluster to a different cluster: for each object o G ©, the relo- 
cation that gives the maximum decrement of the objective 
function w.r.t. the previous clustering will be considered to 
form the new clustering. 

The proposed UCPC converges to a local optimum of the 
objective function therein involved, and works linearly with 
both the size of the input dataset and the dimensionality of 
the input uncertain objects, as formally shown next. 

Proposition 4. The UCPC algorithm outlined in Alg. 1 
converges to a local minimum of function X^cgc "^(^) ^'^ "■ 
finite number of steps. 

Proposition 5. Given a set T> of n m-dimensional un- 
certain objects, the number k of output clusters, and de- 
noting by I the number of iterations to convergence, the 
computational complexity of the UCPC algorithm (Alg. 1) 
is 0{I k n m). 

According to Proposition 5, the proposed UCPC has a 
complexity equal to that of the fastest existing partitional 
methods for clustering uncertain objects, i.e., UK-means 
and MMVar (cf. Section 2); this result proves a major claim 
of this work, which concerns the efficiency in solving the 
problem of partitional clustering uncertain objects. 

5. EXPERIMENTS 

Our experimental evaluation was conducted to assess ef- 
fectiveness, efficiency, and scalability of the proposed UCPC 
algorithm. For the effectiveness and efficiency evaluations, 
we used eight benchmark datasets (where the uncertainty 
was synthetically generated) available from [2] and two real 
datasets (which originally contained uncertainty) that de- 
scribe gene expressions in biological tissues {microarray anal- 
ysis) [3] — Table 1 reports on main characteristics of the 
datasets. Moreover, specifically for the scalability study, we 
used a very large dataset (4 million objects, last row of Ta- 
ble l-(a)), which was employed for the KDD Cup '99 contest 
and now available from the UCI repository [2]. 
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We comparatively evaluated UCPC with the other par- 
titional algorithms, i.e., UK- means (UKM), UK-medoids 
(UKmed), and MMVar (MMV), with the density-based al- 
gorithms, i.e., J"DBSCAN (J"DB) and J"OPTICS (J^OPT), 
and with the agglomcrativc hierarchical algorithm UAHC. 
We also included the UK-means variants, namely the basic 
UK-means (bUKM) and pruning-based methods MinMax- 
BB and VDBiP (cf. Section 2.2); however, they were consid- 
ered in the efficiency evaluation only since they share with 
UK-means the underlying clustering scheme. 

To avoid that clustering results were biased by random 
chance (due to non-deterministic operations, such as com- 
puting initial centroids/medoids/partitions) , all accuracy and 
efficiency measurements for each of the algorithms were av- 
eraged over multiple (50) runs. 

5.1 Assessment Methodology 

The quality of clustering solutions was evaluated by means 
of both external and internal criteria. 

External criteria exploit the availability of reference clas- 
sifications to evaluate how well a clustering fits a predefined 
scheme of known classes. Reference classification is hence in- 
tended as a predetermined organization of the data objects 
into a set of classes; clearly, reference classifications were 
used only for evaluation purposes, and not during the clus- 
tering task. We employed the well-known F-measure (F), 
which ranges within [0, 1] such that higher values correspond 
to better quality results. Denoting with C = {Ci, . . . , Cf.} a 
reference classification and with C — {Ci, . . . , Cfc} a cluster- 
ing solution, F-measure is defined as: 

k 

FiCC) = ^ y \Cu\ max F„„ 

where J^uv = (2 Puv Ruv)/{Puv^+ Ruv) such that P„„ = 
|C„nC„|/|q.| imARuv = |C„nC„|/|C„|, for each «G [l..fc] 
and u £ [l..k]. 

We also used internal criteria based on intra- cluster (intra 
(C)) and inter-cluster {inter{C)) distances (for a given clus- 
tering solution C) which express cluster cohesiveness and 
cluster separation, respectively: 



intrai 



^'^-wi^^mk^) ^ ^^°'^') 



o,o gC, 



inter{C)= 



ICKlci-i) 



E 

■ c,c'ec, 



\C\x\C'\ 



^E E^(°'°') 



ogc o'ec 



Such distance values are finally combined into a single value 
Q{C) = interiC) — intraiC), such that the lower intra{C) or 
the higher inter [C), the better the clustering quality Q{C). 
Since intra and inter values were normalized within [0, 1], 
Q ranges within [—1, 1]. 

Uncertainty generation. We synthetically generated un- 
certainty in benchmark datasets, as they originally contain 
deterministic values; conversely, this was not necessary for 
real microarray datasets since they inherently have probe- 
level uncertainty, which can easily be modeled in the form 



of Normal pdfs according to the multi-mgMOS method [15] . ^ 
According to an approach already employed by previous 
works [4], we developed the following uncertainty genera- 
tion strategy. 

Given a (deterministic) benchmark dataset D, we firstly 
generated a pdf f^ for each (deterministic) point w within 
T). In particular, we considered the Uniform, Normal and 
Exponential pdfs, as they are commonly encountered in real 
uncertain data scenarios [1]. Every f^ was defined in such 
a way that its expected value corresponded exactly to w 
(i.e., ft{fiv) = w), whereas all other parameters (such as the 
width of the intervals of the Uniform pdfs or the standard 
deviation of the Normal pdfs) were randomly chosen. We 
exploited the pdfs f^ to simulate what actually happens in 
typical real contexts for uncertain data. Thus, we focused 
on two evaluation cases: 

1. the clustering task is performed by considering only 
the observed (i.e., non-uncertain) representations of 
the various data objects; 

2. the clustering task is performed by involving an uncer- 
tainty model. 

The ultimate goal was to assess whether the results obtained 
in Case 2 case are better than those obtained in Case 1. 

In Case 1, we generated a perturbed dataset D' from D by 
adding to each point w £T) random noise sampled from its 
assigned pdf /^ according to the classic Monte Carlo and 
Markov Chain Monte Carlo methods.^ As a result, O' still 
contains deterministic data. In our evaluation, each of the 
selected clustering methods was carried out on T>' so that 
it produced an output clustering solution denoted by C'. A 
score F(C' ,C) was hence obtained by comparing the output 
clustering C' to the reference classification of T> (denoted by 
C) by means of the F-measure cluster validity criterion. 

In Case 2, when uncertainty is taken into account, we 
further created an uncertain dataset D" from T) which is the 
one designed to contain uncertain objects. In particular, for 
each w G D, we derived an uncertain object o = (JZ, f) so 
that f ~ fm, while 7?. was defined as the region containing 
most of the area (e.g., 95%) of f^. Again, we run each of 
the selected clustering methods on D" as well, in order to 
obtain a clustering solution C" and a score F{C" ,C). 

Finally, we compared the scores obtained in Case 1 and 
Case 2^ respectively, by computing e{C',C",C) = F{C",C)- 
F{C',C); the higher 9, the better the quality of C" w.r.t. C', 
and, therefore, the better the performance of the clustering 
method when the uncertainty is taken into account w.r.t. 
the case where no uncertainty is employed. Note that Q 
ranges within [—1,1]. 

5.2 Results 

5.2.1 Effectiveness 

Accuracy on Benchmark Datasets. Table 2 shows accu- 
racy results on benchmark datasets for Uniform (U), Normal 
(N), and Exponential (E) distributions, in terms of both ex- 
ternal (Q) and internal (Q) cluster validity criteria (cf. Sec- 
tion 5.1). We also report, for each method, (t) the score for 



Wc 



used the Bioconductor package PUMA {Propagat- 
ing Uncertainty in Microarray Analysis) available at 

http://www.bioinf.inanchester.ac.uk/resources/puma/. 
2 
We used the SSJ library (www.iro.umontreal.ca/~simardr/ssj/). 
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Table 2: Accuracy results on benchmark datasets: external (F-measure) and internal (Quality) criteria 











F-measure (0 G [-1, 1]) 




1 






Quality (Q G [- 


-l.ll) 






data 


Pdf 


SFDB 


^OPT 


UAHC 


UKmod 


UKM 


MMV 


UCPC 


J^DB 


J^OPT 


UAHC 


UKmcd 


UKM 


MMV 


UCPC 




U 


-.102 


.005 


.002 


.023 


-.062 


.043 


.061 


.197 


.093 


.146 


.148 


.151 


.147 


.145 


Iris 


N 


-.063 


.044 


.017 


.010 


-.010 


.056 


.090 


.238 


.135 


.215 


.194 


.263 


.187 


.197 




E 


-.383 


.023 


-.174 


-.045 


-.249 


.153 


.161 


-.004 


.202 


-.001 


.081 


.118 


.692 


.656 




U 


-.179 


.174 


.092 


.175 


-.179 


-.011 


-.023 


-.002 


.128 


-.001 


.012 


-.001 


-.001 


-.006 


Wine 


N 


-.185 


.030 


.197 


-.085 


-.184 


.160 


.156 


.022 


.009 


.050 


.042 


-.020 


.124 


.123 




E 


-.208 


.006 


-.012 


-.104 


-.208 


-.209 


-.191 











.001 





.011 


.013 




U 


-.298 


.012 


.222 


.084 


.066 


.107 


.423 


-.013 


.001 


.001 


.060 


.001 


.226 


.301 


Glass 


N 


-.040 


-.136 


.132 


-.070 


-.025 


-.009 


.128 


.042 


.006 


.119 


.041 


.057 


.008 


.156 




E 


-.334 


-.182 


.131 


.009 


-.231 


.462 


.552 


-.002 








.006 


.004 


.140 


.064 




U 


-.136 


.023 


-.014 


.223 


.199 


.222 


.702 





.449 


.008 


.187 


.101 


.592 


.648 


Ecoli 


N 


.061 


.015 


.269 


.045 


.131 


.508 


.533 


.086 


.284 


.088 


.029 


.141 


.151 


.156 




E 


-.383 


-.239 


-.129 


-.034 


-.160 


.033 


.003 











.003 


.001 


.187 


.210 




U 


-.085 


.252 


.255 


.315 


.220 


.413 


.642 





.029 


.001 


.193 


.041 


.566 


.580 


Yeast 


N 


.079 


-.001 


.306 


-.035 


.159 


.537 


.620 


.040 


.222 


.150 


.005 


.053 


.253 


.272 




E 


-.311 


-.195 


.016 


-.055 


-.098 


.336 


.363 

















.184 


.160 




U 


-.283 


-.113 


.046 


.241 


.278 


.071 


.421 

















.725 


.802 


Image 


N 


-.251 


-.081 


.127 


-.061 


.122 


.028 


.278 


-.001 


.004 


.130 


.010 


.065 


.004 


.253 




E 


-.307 


-.137 


-.020 


.087 


-.024 


.144 


.202 

















.008 


.119 




U 


-.092 


.291 


.084 


.379 


.120 


.539 


.623 


-.018 


.010 


.060 


.071 


.040 


.226 


.232 


Abal. 


N 


.095 


-.039 


.109 


.009 


.034 


.188 


.111 


.086 


.054 


-.030 


.031 


.103 


.057 


.053 




E 


-.182 


.315 


.063 


.025 


.080 


.546 


.542 

















.226 


.283 




U 


-.338 


-.201 


.026 


.237 


.008 


.165 


.582 








.001 








.279 


.297 


Letter 


N 


-.340 


-.203 


.037 


-.039 


-.076 


.127 


.376 


-.022 


.207 


.004 


.357 


.352 


.331 


.305 




E 


-.431 


-.294 


.059 


.033 


-.202 


.133 


.153 

















.147 


.094 




U 


-.189 


.055 


.089 


.210 


.081 


.193 


.429 


.021 


.089 


.027 


.084 


.042 


.345 


.375 


avg score 


N 


-.081 


-.046 


.149 


-.028 


.019 


.199 


.287 


.061 


.115 


.091 


.089 


.127 


.139 


.189 




E 


-.317 


-.088 


-.008 


-.011 


-.137 


.200 


.223 


-.001 


.025 





.011 


.015 


.199 


.200 


overall 


avg. score 


-.196 


-.026 


.077 


.057 


-.012 


.198 


.313 


.027 


.076 


.039 


.061 


.061 


.228 


.255 


overall 


avg. gain 


-1-.509 


+ .339 


-I-.236 


+ .256 


+ .324 


+ .115 


— 


+ .228 


+ .179 


+ .216 


+ .194 


+ .194 


+ .027 


— 



Table 3 


Accuracy 


results (Quality) 


on real datasets 










Quality (Q G 


-1,1|) 




data 


# 


dust. 


J^DB 


.T^OPT UAHC UKmod 


UKM 


MMV UCPC 






2 


-.004 


.010 


.917 


.044 


.057 


.592 .598 






3 


-.004 


.017 


.670 


.047 


.061 


.600 .620 






5 


-.004 


.009 


.847 


.043 


.060 


.678 .718 


Neuro. 




10 


-.004 


.008 


.607 


.048 


.068 


.098 .137 






15 


-.004 


.010 


.578 


.044 


.060 


.675 .717 






20 


-.004 


.009 


.487 


.047 


.061 


.582 .621 






25 


-.004 


.009 


.465 


.041 


.065 


.596 .631 






30 


-.004 


.008 


.466 


.043 


.047 


.532 .564 






2 


-.018 


.068 


.445 


.221 


.207 


.212 .224 






3 


-.018 


.080 


.258 


.256 


.392 


.305 .352 






5 


-.018 


.061 


.160 


.245 


.451 


.481 .537 


Leuk. 




10 


-.018 


.213 


.150 


.238 


.455 


.405 .451 






15 


-.018 


.192 


.145 


.246 


.451 


.501 .544 






20 


-.018 


.186 


.126 


.213 


.479 


.492 .528 






25 


-.018 


.353 


.127 


.215 


.558 


.588 .620 






30 


-.018 


.369 


.122 


.213 


.448 


.483 .512 


Neuro. avg 


score 


-.004 


.010 


.630 


.045 


.060 


.544 .576 


Leuk. avg 


score 


-.018 


.190 


.192 


.231 


.430 


.433 .471 


over, avg 


score 


-.011 


.100 


.411 


.138 


.245 


.489 .523 


over, avg gain\ 


+ .534 


+ .423 


+ .112 


+ .385 


+ .278 +.034 — 



each type of pdf averaged over all datasets (for short, aver- 
age score), (ii) the score averaged over all datasets and pdfs 
(for short, overall average score), and (ttt) the overall aver- 
age gain of the proposed UCPC computed as the difference 
between the overall average score of UCPC and the overall 
average score of any specific competing method. 

Looking at the overall average scores and gains, the pro- 
posed UCPC was more accurate than any other compet- 
ing method, in terms of both O and Q, with gains up to 
0.509 (6) and 0.228 (Q). This finding was confirmed by the 
results obtained in terms of single dataset-by-pdf configu- 
ration. Indeed, according to O, UCPC achieved the best 
results on 17 out of 24 dataset-by-pdf configurations, while, 
for additional 5 configurations (i.e., all remaining ones ex- 
cept Wine-Uniform and Wine- Exponential), its gap from the 



best competing method was negligible (smaller than 0.080). 
Similarly, considering Q, UCPC was the best method on 
the majority (13) of the dataset-by-pdf configurations and 
achieved results comparable to the best ones in further 9 
configurations: only on two configurations (Wine-Uniform 
and Ecoli-Normal) its gap from the best method was greater 
than 0.080. 

Finally, we remark that the proposed UCPC generally 
outperformed its most direct competitors UK-means and 
MMVar, thus confirming a major claim of this work. Com- 
pared to UK-means, UCPC achieved better Q results on all 
24 dataset-by-pdf configurations (gain up to 0.783), whereas 
in terms of Q, UCPC outperformed UK-means on 19 out of 
24 configurations (gain up to 0.802), while achieving negligi- 
ble gaps (smaller than 0.070) in the remaining 5 configura- 
tions. UCPC outperformed MMVar as well, though MMVar 
results were in general better than those achieved by UK- 
means. Particularly, in terms of O, UCPC was better than 
MMVar on 19 out of 24 dataset-by-pdf configurations (gain 
up to 0.480), while being comparable on the remaining 5 
configurations (gaps smaller than 0.080). 

Accuracy on Real Datasets. Table 3 shows accuracy re- 
sults obtained on Neuroblastoma and Leukaemia, and also 
summarizes (i) the scores on each datasct by averaging over 
the cluster numbers, and (ii) the scores and gains by averag- 
ing over all cluster numbers and datasets (for short, overall 
average score). Due to the unavailability of reference clas- 
sifications for such datasets, we performed multiple tests by 
varying the number of clusters and assessed the results based 
on the internal criterion Q only. 

UCPC achieved the best overall average performance, with 
maximum, average and minimum gains (over all the com- 
peting algorithms) of 0.534 (w.r.t. J^DBSCAN), 0.294, and 
0.034 (w.r.t. MMVar), respectively. Moreover, in terms of 
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(c) Real datasets 
Figure 4: Efficiency results 



average scores, UCPC was the best method on Leukaemia, 
while outperforming all methods but UAHC on Neuroblas- 
toma; however, the gap from UAHC remained small (0.054). 
Again, our UCPC generally outperformed UK-means and 
MMVar. Indeed, UCPC was on average more accurate than 
UK-means on both datasets, while achieving better results 
on 14 out of 16 dataset-by-number of clusters configurations, 
whereas, compared to MMVar, UCPC achieved better re- 
sults on all 16 dataset-by-number of clusters configurations. 

5.2.2 Efficiency 

We also evaluated time performance of our UCPC on 
both benchmark and real datasets.^ As previously men- 
tioned, in this evaluation we also included the basic UK- 
means, MinMax-BB and VDBiP. Actually, in order to pos- 
sibly strengthen the pruning power of MinMax-BB and VD- 
BiP, they were both coupled with the cluster-shift technique, 
since it has been demonstrated to have beneficial pruning 
effects [11, 17]. It is worth emphasizing that the pruning 
tiines (i.e., times spent to build and maintain data struc- 
tures needed for pruning) were discarded from our evalua- 
tion, since we chose to focus primarily on the clustering time 
performance; this also allowed us not to take into consider- 
ation R-tree index variants of the Voronoi-diagrams-based 
pruning (e.g., the RBi algorithm [11]), since R-tree mainly 
boosts to reduce the pruning time. For an analogous rea- 
son, we did not consider the time spent for the off-line stages 
(i.e., distance pre-computation) required by UK-means and 




50% 65% 80% 95% 



Figure 5: Scalability on the KDD Cup '99 dataset 



UK-medoids as well. In this respect, we remark that our 
UCPC docs not require any off-line phase. 

Figure 4 reports the clustering runtimes (in milliseconds) 
obtained by the various methods on benchmark and real 
datasets. Note that, due to space limits of this paper, we 
present results only for the two largest among the bench- 
mark datasets (excluding KDDCup99); nevertheless, the per- 
formance trends we observed on the remaining datasets were 
roughly similar to those of the datasets here reported. In the 
figures, results on each dataset are organized in two plots: 
the left-hand plot contains results obtained by the "slower" 
algorithms, i.e., UK-medoids, basic UK-means, UAHC, and 
the density-based algorithms, whereas the right-hand plot 
contains results obtained by the "faster" MMVar, UK-means 
and the pruning methods; moreover, to facilitate the com- 
parison with the competing algorithms, each plot also re- 
ports the performance of our UCPC. 

Looking at the left-hand plots in Figure 4, we observe 
that UCPC consistently outperformed all the competing al- 
gorithms, more specifically: basic UK-means (1 order of 
magnitude on benchmark datasets, and 2 orders on real 
datasets), UAHC (3-5 orders), UK-medoids (3-4 orders), 
J^OPTICS (2-3 orders), and J"DBSCAN (1-3 orders of mag- 
nitude). Concerning the comparison of the "faster" algo- 
rithms, UCPC performed very closely to UK-means and 
MMVar, thus confirming a inajor claim of this work. Indeed, 
UCPC achieved times always of the same order of magnitude 
as UK-means and MMVar. The difference among these three 
algorithms was very small and negligible in practice. It was 
also interesting to observe that in most cases, the pruning- 
based UK-means algorithms (i.e., MinMax-BB and VDBiP) 
behaved very similarly to each other, and they were always 
slower than UK-means and MMVar. Note that the worse 
performance of MinMax-BB and VDBiP w.r.t. UK-means 
is justified since UK-means does not perform any expected 
distance computation in the on-line phase. Conversely, the 
pruning methods significantly improved over the basic UK- 
means (1 order of magnitude) . Moreover, UCPC performed 
generally better than MinMax-BB and VDBiP and, in some 
cases, the gap was quite evident (e.g., 1 order of magnitude 
on the real datasets). 

Scalability. We carried out a scalability study using the 
KDD Cup '99 dataset.* Figure 5 summarizes the results 



Experiments were conducted on a quad-core platform Intel Pentium 
IV 3GHz with 4GB memory and running Microsoft WinXP Pro. 



For this study, we used the CRESCO HPC system 
(www.cresco.enea.it), which is integrated into the ENEA-GRID 
infrastructure. CRESCO is a general purpose system composed by 
382 nodes with more than 3300 cores. We executed our experiments 
on a Centos 5.5 platform, with Linux 2.6.18 kernel, 64GB memory, 
4 Intel(R) Xeon(R) CPU E7330, 2.40GHz quadcore. 
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of this study, for which we varied the dataset size from 5% 
to 100% and focused on UCPC and the fastest competing 
algorithms. For each selected subset of the collection, we 
ensured that all 23 classes were covered by the objects within 
the subset. Thus, the number of clusters was conveniently 
fixed to 23 for all the algorithms under consideration. 

As expected, all algorithms (including our UCPC) exhib- 
ited linear trends. Particularly, MMVar scaled better than 
the other algorithms, and UCPC behaved very similarly 
to UK-means. It was also interesting to observe that the 
pruning-based UK-means algorithms were subject to some 
fluctuations, which should be ascribed to a different effect 
of the pruning on the various dataset portions. 

6. CONCLUSION 

In this paper we defined a novel, well-founded notion of 
uncertain centroid for clusters of uncertain objects. The 
proposed notion differs from existing ones in that it repre- 
sents an uncertain object with a clear stochastic meaning, 
which conceptually refers to possible deterministic represen- 
tations of the objects being clustered. Based on this notion, 
we developed a formulation that overcomes the limitations 
of existing cluster compactness criteria by taking into ac- 
count the sum of expected values as well as the sum of the 
variances of the individual objects in the cluster. Experi- 
ments on synthetic and real data have supported our claims 
of efficiency and, more importantly, improved accuracy in 
clustering uncertain objects. 
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APPENDIX 

Lemma 1. Civen a cluster C of m-dimensional uncertain 
objects, where o = (7?., /), Vo £ C, the function Juk defined 
in (9) is equal to: 

juk{c) = J2[ E(^2)^(°) ~ ^ (e w(o)] j 

D 

Proposition 1. Civen any two clusters C and C' of un- 
certain objects, it holds that: 

Juk{C) = Juk{C') ^ Y. '^'(°) = Y. '^'(°') 

o6C o'£C' 

Proof Sketch. It is sufficient to find a case where 
Juk{C) = Juk{C') holds and Eosc ^'(o) = Eo'gc ^'(o') 
does not. To this end, let us assume that: 1) jC| = 
\C'\, 2) E;=iEoec(M2).(o) = Er=iE„'ec'(M2).(o'), 3) 
EoecW(o) = Eo'ec'W(o')- Vj G [l..m], and 4) Y.7=i 
Eoec/^l(°) ^ E^iEo'eC'A'?(o')- Considering assump- 
tions l)-3) and according to Lemma 1, it follows that: 



oSC 



o'eC 



^ (Ea'.W) =( E w(o')) , VjG[l. 
Vosc / Xo'eC J 

=> E E'^^w =E E^^(«') 

3=1 \oec / j=i Ko'ec 
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m 1 "^ / \ 

j-ioGC ' ' j-i \oec I 

= EE(M^)^(°')-^E E^^(°') 

=> Juk{C) ^ Juk{C') 

Similarly, considering also assumption 4), it can be further 
derived that: 

171 m 

E E ^^» ^Y.Y. ^^>') ^ E -'(°) ^ E -'(°') 

j=i oec j=io'eC' oec o'ec 

a 



Lemma 2. Let C be a cluster of uncertain objects, where 
o = {TZ, f), Vo G C, and Cmm be the centroid of C employed 
by MMVar. It holds that: 



fliCK 



i5|£*' 



p2 (Ca 



jig.-.W 



D 



Proposition 2. Let C be a cluster of m-dimensional un- 
certain objects, where o = (TZ,f), ^o £ C. In reference to 
the functions Juk and Jmm defined m (9) and (11), respec- 
tively, it holds that Jmm{C) = \C\~^ Juk(C) . 

Proof Sketch. According to (4) and (6), Jmm{C) = 
(T^(C) can be rewritten as Jmm{C) = Y^JLi ((a*2)j(Cma/)- 
fj.'j (Cmm)) , which, resorting to Lemmas 2 and 1, becomes: 

m 

Jmm{C) = 2_^ ((a'2)j(Ca/a/) — ^J-j{Cmm)) = 

D 

Lemma 3. The squared expected distance ED{o,o') be- 
tween any two m-dimensional uncertain objects o = (7?., /) 

and o' — {TZ',f') is equal to Y^^^-^ ((/i2)j(o) — 2/ij(o)/ij(o') + 
{^^2Uo'))■ □ 



Proposition 3. Let C be a cluster of m-dimensional un- 
certain objects, where o = (TZ,f), Vo G C. In reference to 
the functions Juk, Jmm, and J defined m (9), (11), and 
(12) respectively, it holds that J{C) = 2 |C| Jmm{C) = 
2 Juk{C) 

Proof Sketch. According to Lemma 3, function J 

reported in (12) can be rewritten as: 



J{C) =^ E^^'^^ )j (o) - 2^j (o)/ij [Cmm) + (^2)^ [Cmm, 



E &2),(0)-2/i,(CA/A/)^M,(0) + |C|(M2),(CA/M) 



Since J2^^fij{o) = \C\fij{CMM) a-ndJ2oec(l^2)j{o) = \C\x 
x(/i2)j(CA/A/) according to Lemma 2, and (o"^)j(o) = (/i2)j(o) 
—fi^{o) according to (5), the latter expression becomes: 

771 m 

J{C)^2\C\ J2{{^^2UCMMy^^^,(CMM))^2\C\ E('^')XGv/A/) 



J=l 



j=l 



By resorting to (6), (11), and Proposition 2, we have finally 
that: 



JiC) = 2\C\a\CMM) = 2\C\Jmm{C) = 2Juk{C). 



a 



Theorem 1. Given a cluster C = {oi, . . . , 0|c|} of m- 
dimensional uncertain objects, where Oi = {TZi, fi) andlZi = 

[4^', uf )]x- • •x[4"'', u^^] , Vi € [l..|C|], let C = (^,7) be the 

U-centroid of C defined by employing the squared Euclidean 
norm as distance to be minimized. It holds that: 



f{S) = 



1 "" 



CI _ \c\ 



\c\ 



TZ- 






]^/i(fi)d:Ei • • • df Id 



, |C| |C| 

U = l ' 'i = l 



where V[A\ is the indicator function, which is 1 when the 
event A occurs, otherwise. 

Proof Sketch. Let us consider sets S = {{a?i, . . . ,x\c\} 
I xi G 7^l A • ■ • A f|c| G 7^jcl} and 5£ = {S'|SG5Af = 
argmin^-giRm X^s'gs '^(J/''^')}- ^^ 5 represents a probability 
space, it can be exploited for defining a random variable 
Xs, whose realizations Xs ~ S describe the events that 
the actual representations of the objects oi, . . . ,o\c\ G C 
correspond to the points xi, . . . ,a?|c| G S, respectively. Xs 
can be exploited as a conditional variable to derive: 

fxc (5) = 7(^) = / fxa\Xs^s{x\S) fxs {S) AS 
J ses 

where fxgiS) is the pdf of Xs- As fxc\Xs=s{x\S) = 1 if 
S G Sx, otherwise, as well as fxs{S) ~ Pr(oi = i?i A 

■ ■■ A 0101^= X|c|) = nE'i Pr(o, = fO = U[t[ /.(^O, V5 = 
{xi, . . . , 2f|(;j|} G 5, it follows that: 

7(f)=/ /xe|Xs=s(^|S)/xs(5)d5 = 
Jses 

\c\ 1 |C 

; = argmin y^d{y,Xi) TT/i(f,)dfi- • -dxici 

Jg3{m^ Jii 

By employing the squared Euclidean norm as distance d, the 
optimization problem x — argminy-gs^TTi ^[_l|^ ^(3/", 3;^) be- 
comes x = argmin^-gsR'" giU), where g{y} = J^E'i Wv- ^i\f- 
The solution of such a problem is y = \C\~^ ^[Ji ^i- ^J 
replacing this into the expression of f{x), we obtain: 

/(^) ^ • • • 1^ x = arg min y^d{y, f TT/,(f ,)df r • -df |c| 
J J I y^^'"~l J ~^j 



/■ f \ 1 '"" ■ 



SiG'Ri £|ciS''^|q 



|C| 



]^/i(fi)df 1 • ■ • df |c 
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which demonstrates the first statement of the theorem. 

The second statement can be proved by observing that 
ali possible representations of the U-centroid C can only 
results by averaging over the possible representations of the 
uncertain objects within the cluster C. D 

Lemma 4. Let C — {oi, . . . ,o^c\} be a cluster of uncer- 
tain objects, where Oi — {TZi,fi), Vi G [l..|C|], and C = 
{TZ, f) be the U-centroid of C defined according to Theo- 
rem 1, Given any function g:7?.— >5R, it holds that: 

g{x)f{x) dx= ■■■ / g[T^^XiW[fi{x^) dfr ■ -dx^ci 

D 

Lemma 5. Let C — {oi, . . . , o^c\} be a cluster of uncer- 
tain objects, where Oi = {TZi,fi), Vi G [l..jC|], and C = 
{Ti, f) be the U-centroid of C defined according to Theo- 
rem 1. It holds that fUC) = \C\~^ X]'=i fli^i), cind 



_ 1 /''^l I'^l"^ I'^l 

M2(C) = — - ^ p2{oi) + 2^ p{oi) Y^ p{o,, 



D 



Theorem 2. Given a cluster C — {oi , . . . , o^c\} of in- 
dimensional uncertain objects, where Oi = {TZi,fi), Vi G 
[l..|C|], let C = {TZ, f) be the U-centroid ofC defined accord- 
ing to Theorem 1. It holds that a'^(C) = \C\'^ 5]£i cr^(o,). 

Proof Sketch. From (5) and (6), it follows that 

^'(C) =J:T^^ {(1^2) J (C) - /i,'(C)), which, exploiting fl(C) 
and /22(C) derived in Lemma 5, can be rewritten as: 



j=i 



1 " '"" / 2 \ 1 "" 2 

icpEE((/'2),(oo-M'(oo) = 7c^E^'(«' 



\c\ 



j^l i=l 



D 



Theorem 3. Let C = {oi, . . . ,0|c|} be a cluster of m- 
dimensional uncertain objects, where Oi — {TZi,fi), Vi G 
[l..|C|], and C = {TZ, f) be the U-centroid of C defined ac- 
cording to Theorem 1. In reference to the function J defined 
in (14), it holds that: 



Jic)=Y: 



(j) 



Ici 



$ 



c 



T 



(i) 



ICI 



\c\ 



^J]a^(oO + ^c/x(C) 



|C 



where Juk is the UK-means objective function (cf. (9)) and 

*[^''=B'^').(«o 4''=B^2)^(«') Tf^c'=(EA*.(«o| 

1=1 i = l \i = l / 



Proof Sketch. According to Lemma 3, it holds that: 

m /|C| _ \C\ \ 

'^(^)=E E(/^2),(o0-2 m.(C)B«(oO + |C|(m2),(C) 
j=\ \i=l i=l J 

Also, according to Lemma 5, it results that iJ.j{C) = |C|^^x 
E1=iW(oO> and 

_ /\c\ /\c\ Y ic^l \ 

it^2)AC)~i^{t^2),{o.)+lJ2>^Ao,)\ -E^?(«0 

Thus, substituting such expressions of yij{C) and (/i2)j(C) 
into function J, we obtain: 



m / I C" ! 



|C| 



■^(^) = E E(^2),(oo-^ Ew(oO + 



j=i yi=i 



-j^ / |C| / \c\ \ \c\ 

TFtA E('^2)j(o0+ YlJ.j{o{) -^^1(0 



1=1 



(J) 



vp 



T 



U) 



U\\c\^ ^ |C| 



which proves the first part of the theorem. The second part 
can be derived by applying the results from Lemma 2: 

^ \c\ m / |C| / \c\ \^\ 

•^(^^pE-'(°')+!^El|qE'^^)^(''0-||qE^^(°0lj 

Id ICI 

= |^E'^'(°') + |C'k'(^MM) = ^y^Ao^) + JuK{C) 

' ' 1=1 ' ' i=l 



D 



Proposition 4. T/ie UCPC algorithm outlined in Alg. 1 
converges to a local minimum of function X^cgc "^(*-^) ^'^ " 
finite number of steps. 

Proof Sketch. Let us denote by V*'*^ the value X^cscC') 
J{C), where C'''' is the clustering computed at the /i-th 
iteration of UCPC. To prove the proposition, it is sufficient 
to show that V^'''' < 1/'''"^' at each iteration fe > 1, as the 
function '^q^i^ J(C) is bounded below. This is true as at 
each step of the algorithm the optimal move of objects to 
clusters is performed. D 

Proposition 5. Given a set T> of n m-dimensional un- 
certain objects, the number k of output clusters, and de- 
noting by I the number of iterations to convergence, the 
computational complexity of the UGPC algorithm (Alg. 1) 
is 0{I k n m). 

Proof Sketch. The initiahzation (offline) phase (Lines 
1-3) takes 0{k n m), as well as the main cycle (Lines 4-16), 
thanks to the formulas derived in Corollary 1; this leads to 
an overall complexity of 0{I k n rn). D 
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